Recent large-scale image generation models such as Stable Diffusion have exhibited an impressive ability to generate fairly realistic images starting from a very simple text prompt. Could such models render real images obsolete for training image prediction models? In this paper, we answer part of this provocative question by questioning the need for real images when training models for ImageNet classification. More precisely, provided only with the class names that have been used to build the dataset, we explore the ability of Stable Diffusion to generate synthetic clones of ImageNet and measure how useful they are for training classification models from scratch. We show that with minimal and class-agnostic prompt engineering those ImageNet clones we denote as ImageNet-SD are able to close a large part of the gap between models produced by synthetic images and models trained with real images for the several standard classification benchmarks that we consider in this study. More importantly, we show that models trained on synthetic images exhibit strong generalization properties and perform on par with models trained on real data.
translated by 谷歌翻译
Despite significant advances, the performance of state-of-the-art continual learning approaches hinges on the unrealistic scenario of fully labeled data. In this paper, we tackle this challenge and propose an approach for continual semi-supervised learning -- a setting where not all the data samples are labeled. An underlying issue in this scenario is the model forgetting representations of unlabeled data and overfitting the labeled ones. We leverage the power of nearest-neighbor classifiers to non-linearly partition the feature space and learn a strong representation for the current task, as well as distill relevant information from previous tasks. We perform a thorough experimental evaluation and show that our method outperforms all the existing approaches by large margins, setting a strong state of the art on the continual semi-supervised learning paradigm. For example, on CIFAR100 we surpass several others even when using at least 30 times less supervision (0.8% vs. 25% of annotations).
translated by 谷歌翻译
We propose a novel antialiasing method to increase shift invariance in convolutional neural networks (CNNs). More precisely, we replace the conventional combination "real-valued convolutions + max pooling" ($\mathbb R$Max) by "complex-valued convolutions + modulus" ($\mathbb C$Mod), which produce stable feature representations for band-pass filters with well-defined orientations. In a recent work, we proved that, for such filters, the two operators yield similar outputs. Therefore, $\mathbb C$Mod can be viewed as a stable alternative to $\mathbb R$Max. To separate band-pass filters from other freely-trained kernels, in this paper, we designed a "twin" architecture based on the dual-tree complex wavelet packet transform, which generates similar outputs as standard CNNs with fewer trainable parameters. In addition to improving stability to small shifts, our experiments on AlexNet and ResNet showed increased prediction accuracy on natural image datasets such as ImageNet and CIFAR10. Furthermore, our approach outperformed recent antialiasing methods based on low-pass filtering by preserving high-frequency information, while reducing memory usage.
translated by 谷歌翻译
Vision Transformers (ViTs) have become a dominant paradigm for visual representation learning with self-attention operators. Although these operators provide flexibility to the model with their adjustable attention kernels, they suffer from inherent limitations: (1) the attention kernel is not discriminative enough, resulting in high redundancy of the ViT layers, and (2) the complexity in computation and memory is quadratic in the sequence length. In this paper, we propose a novel attention operator, called lightweight structure-aware attention (LiSA), which has a better representation power with log-linear complexity. Our operator learns structural patterns by using a set of relative position embeddings (RPEs). To achieve log-linear complexity, the RPEs are approximated with fast Fourier transforms. Our experiments and ablation studies demonstrate that ViTs based on the proposed operator outperform self-attention and other existing operators, achieving state-of-the-art results on ImageNet, and competitive results on other visual understanding benchmarks such as COCO and Something-Something-V2. The source code of our approach will be released online.
translated by 谷歌翻译
我们考虑在给定的分类任务(例如Imagenet-1k(IN1K))上训练深神网络的问题,以便它在该任务以及其他(未来)转移任务方面擅长。这两个看似矛盾的属性在改善模型的概括的同时保持其在原始任务上的性能之间实现了权衡。接受自我监督学习训练的模型(SSL)倾向于比其受监督的转移学习更好地概括。但是,他们仍然落后于In1k上的监督模型。在本文中,我们提出了一个有监督的学习设置,以利用两全其美的方式。我们使用最近的SSL模型的两个关键组成部分丰富了普通的监督培训框架:多尺度农作物用于数据增强和使用可消耗的投影仪。我们用内存库在即时计算的类原型中代替了班级权重的最后一层。我们表明,这三个改进导致IN1K培训任务和13个转移任务之间的权衡取决于更加有利的权衡。在所有探索的配置中,我们都会挑出两种模型:T-Rex实现了转移学习的新状态,并且超过了In1k上的Dino和Paws等最佳方法,以及与高度优化的RSB--相匹配的T-Rex*在IN1K上的A1模型,同时在转移任务上表现更好。项目页面和预估计的模型:https://europe.naverlabs.com/t-rex
translated by 谷歌翻译
自主驾驶的最新作品已广泛采用了鸟眼视图(BEV)语义图作为世界的中间表示。这些BEV地图的在线预测涉及非平凡操作,例如多摄像机数据提取以及融合和投影到常见的顶级网格中。这通常是通过易易错的几何操作(例如,单眼深度估计的同构图或反射)或BEV中图像像素和像素(例如,具有MLP或注意力)之间的昂贵直接密集映射来完成。在这项工作中,我们提出了“ Lara”,这是一种有效的编码器编码器,基于变压器的模型,用于从多个摄像机中进行车辆语义分割。我们的方法使用交叉注意的系统将信息通过多个传感器汇总为紧凑而丰富的潜在表示。这些潜在的表示在通过一系列自我发场块处理后,在BEV空间中进行了第二次交叉注意。我们证明,我们的模型在Nuscenes上的表现优于使用变压器的最佳先前作品。
translated by 谷歌翻译
通过与环境进行互动而没有任何外部监督是一个重要的挑战,可以通过与环境进行互动来学习各种技能。特别是,获得可以达到任何给定状态的目标条件的代理在许多应用中都有用。我们提出了一种新的方法,用于训练这种目标条件的代理,而没有任何外部奖励或任何领域知识。我们使用随机步行来训练可及性网络,以预测两个状态之间的相似性。然后,该可达性网络将用于构建目标记忆,其中包含过去的观察结果,这些观察值多样化且平衡。最后,我们训练一个目标条件条件的政策网络,其目标是从目标记忆中取得的目标,并通过可达性网络和目标记忆进行奖励。当代理商发现并学习新目标时,所有组件在整个培训中都进行了更新。我们将方法应用于连续的控制导航和机器人操纵任务。
translated by 谷歌翻译
视听自动语音识别(AV-ASR)是ASR的扩展,它通常来自扬声器嘴的动作。与仅关注唇部运动的作品不同,我们研究了整个视觉框架(视觉动作,对象,背景等)的贡献。这对于不一定可见的说话者不一定可见的视频特别有用。为了解决这项任务,我们提出了一个新的序列到序列视听ASR变压器(Avatar),该序列是从频谱图和全帧RGB端到端训练的。为了防止音频流主导训练,我们提出了不同的单词掩盖策略,从而鼓励我们的模型注意视觉流。我们证明了视觉模态对2 AV-ASR基准测试的贡献,尤其是在模拟噪声的情况下,并表明我们的模型以很大的边距优于所有其他先前的工作。最后,我们还为AV-ASR创建了一个名为Visspeech的新的现实世界测试床,该床在挑战性的音频条件下展示了视觉模态的贡献。
translated by 谷歌翻译
当自我监督的模型已经显示出比在规模上未标记的数据训练的情况下的监督对方的可比视觉表现。然而,它们的功效在持续的学习(CL)场景中灾难性地减少,其中数据被顺序地向模型呈现给模型。在本文中,我们表明,通过添加将表示的当前状态映射到其过去状态,可以通过添加预测的网络来无缝地转换为CL的蒸馏机制。这使我们能够制定一个持续自我监督的视觉表示的框架,学习(i)显着提高了学习象征的质量,(ii)与若干最先进的自我监督目标兼容(III)几乎没有近似参数调整。我们通过在各种CL设置中培训六种受欢迎的自我监督模型来证明我们的方法的有效性。
translated by 谷歌翻译
大规模未标记数据集的预培训显示了计算机视觉和自然语言处理领域的令人印象深刻的性能改进。鉴于大规模教学视频数据集的出现,预训练视频编码器的常见策略是使用随附的语音作为弱监管。但是,由于演讲用于监督预培训,视频编码器从未见过,这不会学会处理该模态。我们解决了当前预训练方法的这种缺点,这未能利用口语语言中的丰富的线索。我们的提议是使用所有可用的视频模型作为监督,即外观,声音和转录语音预先列车。我们在输入中掩盖了整个模态并使用其他两个模态预测它。这鼓励每个码头与其他方式合作,我们的视频编码器学会处理外观和音频以及语音。我们展示了我们在How2R,YouScook2和浓缩电影数据集上视频检索的“模态屏蔽”预培训方法的卓越性能。
translated by 谷歌翻译